Short Text Conceptualization Using a Probabilistic Knowledgebase

نویسندگان

  • Yangqiu Song
  • Haixun Wang
  • Zhongyuan Wang
  • Hongsong Li
  • Weizhu Chen
چکیده

Most text mining tasks, including clustering and topic detection, are based on statistical methods that treat text as bags of words. Semantics in the text is largely ignored in the mining process, and mining results often have low interpretability. One particular challenge faced by such approaches lies in short text understanding, as short texts lack enough content from which statistical conclusions can be drawn easily. In this paper, we improve text understanding by using a probabilistic knowledgebase that is as rich as our mental world in terms of the concepts (of worldly facts) it contains. We then develop a Bayesian inference mechanism to conceptualize words and short text. We conducted comprehensive experiments on conceptualizing textual terms, and clustering short pieces of text such as Twitter messages. Compared to purely statistical methods such as latent semantic topic modeling or methods that use existing knowledgebases (e.g., WordNet, Freebase and Wikipedia), our approach brings significant improvements in short text understanding as reflected by the clustering accuracy.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Context-Dependent Conceptualization

Conceptualization seeks to map a short text (i.e., a word or a phrase) to a set of concepts as a mechanism of understanding text. Most of prior research in conceptualization uses human-crafted knowledge bases that map instances to concepts. Such approaches to conceptualization have the limitation that the mappings are not context sensitive. To overcome this limitation, we propose a framework in...

متن کامل

How to Make a Semantic Network Probabilistic

Words and phrases associate with each other to form a semantic network. Characterizing such associations is a first step toward understanding natural languages for machines. Psychologists and linguists have used concepts such as typicality and basic level conceptualization to characterize such associations. However, how to quantify such concepts is an open problem. Recently, much work has focus...

متن کامل

Open Domain Short Text Conceptualization: A Generative + Descriptive Modeling Approach

Concepts embody the knowledge to facilitate our cognitive processes of learning. Mapping short texts to a large set of open domain concepts has gained many successful applications. In this paper, we unify the existing conceptualization methods from a Bayesian perspective, and discuss the three modeling approaches: descriptive, generative, and discriminative models. Motivated by the discussion o...

متن کامل

Promoting Precision Cancer Medicine through a Community-Driven Knowledgebase

Increasing efforts are being dedicated towards improving cancer care via personalized medicine. These efforts depend to a large degree on the availability of a knowledge foundation. Unfortunately, existing knowledge linking cancer drugs and potential efficacy biomarkers is in its infancy; and where links are known, they are frequently unstructured and poorly documented. We have developed a new ...

متن کامل

The Impact of Input Enrichment in Long Text vs. Short Texts on Grammatical Accuracy in Writing Among Elementary Language Learners

This study was conducted to investigate the influence of teaching accurate grammar inwriting via enriched long text and short text for the elementary students atShokouhe_Farhang institute. The homogenized subjects were divided into two groups of 18and 17 participants. Using a writing exam as a pretest in order to check the students’knowledge in English past tense. The control group received the...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011